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Abstract 

In science, the most widespread statistical quantities are perhaps p- 
values. A general advice is to reject the null hypothesis Ho if the correspond- 
ing p- value is sufficiently small (usually smaller than 0.05). Many criticisms 
regarding p-values have arisen in the scientific literature. The main raised 
issue is that p-values are not measures of evidence (support measures) over 
the parametric space 0. Here, we propose a frequentist measure of ev- 
idence for very general null hypotheses that satisfies logical requirements 
on the subsets of O that are not met by p- values (e.g., it is a possibilistic 
measure) . We study the proposed measure in the light of the abstract belief 
calculus formalism and we conclude that it can be used to establish objec- 
tive states of belief on the subsets of 0. Based on its properties, we strongly 
recommend this measure as an additional summary of significance tests. At 
the end of the paper we give a short listing of possible open problems. 

Keywords: Abstract belief calculus, evidence measure, likelihood-based con- 
fidence, nested hypothesis, p-value, possibilistic measure, significance test 



1 Introduction 



Tests of s ignificance are subjects of intense debate and discussion 



statisticians (K> 



Schervish 



1990: 



cmpthornc 



1996 



Rovall , 



1976; 



1997 



Cox, 



1977 



Bcrg er and Sellke 



Darwiche and Ginsberg 



1992; 



1987 



among 



Aitkin 



Mayo and Coxl. 120061). scientists in general (IDubois and Prade . 



Friedman and Halpern 



2003; 



1996 



many 



1991 



Wagenmakers . 



Mavol 120041 ). In this paper, wc 



20071 1 and philosophers of science ([Stern, 
discuss some limitations of p- values (which is a well-explored territory) and we 
propose an alternative measure to establish objective states of belief on the subsets 
of the full parametric space 8. Currently, many scholars h ave been studying the 



controversies and limitations of y- values (see, for instance. iMayo and 
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2012; Din iz et all 120121) and others have proposed some alternatives ([Zand . 12009 ; 



20101 ) . In this paper, besides proposing a classical (read frequentist) mea- 



sure of evidence, we also provide a connection wit h the abstract belief calculus 
(ABC) proposed bv iDarwiche and Ginsberg (|1992l ). which certifies the status of 
"objective state of belief" for our proposal, see Section [4] for specific details. 

A procedure that measures the consistency of an observed data x (the capital 
letter X denotes t he random quantity) with a n ull hypothesis Hp is known as 



a sign ificance test (jKempthornd . Il976 ; 



Cox 



19771) . According to iMavo and Cox 



(|2006l) . to do this in the frequentist paradigm, we may find a function t = t(x) 
called test statistic such that: (1) the larger the value of t the more inconsistent are 
the data with Hq and (2) the random variable T = t(X) has a known probability 
distribution when Hq is true (at least asymp totically) . The p- value related to the 



statistic T (the observed level of significance ICoxl . Il977l ) is then computed as 



p = P(T > t; under H ), 



which is regarded as a measure of concordance with Ho (|Mavo and Coxl . [2006). 
A p- value is a probability of an unobserved T be, at least, as extreme as the ob- 
served t when Hq is true. This means that small values of p indicate a discordance 
of the data probabilistic model from that specified in Ho- It is of large agree- 
ment to set in advance a t hreshold value a to reject Hq if and only if p < a. 
Pereira and Wechslerl (|1993l) point out some problems w hen the statistic T does 
not consider the alternative hypothesis. ISchervishl (|199a ) had ar gued that y- values 
as me asures of evidence for hypotheses has serious logical flaws. iBerger and Sellke 
(1987) argue that p- values can be highly misleading measures of the evidence pro- 
vided by the data against the null hypothesis. In this paper, we present two 
examples where conflicting conclusions arise if p- values are used to take decisions 
regarding the inadequacy of a hypothesis. Then, we propose a measure of evidence 
for general null hypotheses that is free of those conflicts and has some important 
philosophical implications in the frequentist paradigm which will be detailed in 
future works. 

Here, the null hypothesis is defined in a parametric context, let 9 £ O C M. k 
be the model parameter, the null hypothesis is defined as H : 9 £ Oo, i.e., if H 
is true, then the true unknown value of the parameter vector lies in the subset 
8o C 8. The parametric space of the alternative hypothesis will always be the 
complement of 0o with respect to 0, therefore, if the alternative hypothesis is 
specified the full parametric space has to take it into account. A hypothesis test 
attempts to reduce a family of possible measures that governs the data behavior, 
say V = {Pg : 9 £ 9}, to a more restricted one, say To = {Pe : £ Oo}- There 
are many ways to find a test statistic, it essentially depends on the topologies of O 
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and 9. When Qq and its complement have one element each, the Neyman- Pearson 
Lemma provides the most powerful test (which is the likelihood ratio statistic) for 
any pre-fixed significance value. Naturally, we can use this statistic to compute a 
p- value. For the general case, the generalization of likelihood ratio statistic (which 
will be called only by likelihood ratio statistic) is given by 

X(x) - suPe ^o L (°ix) 



where L(8, x) is the likelihood function. Observe that the likelihood ratio statistic 
does take into account the alternative hypothesis (since 9 = 80 U 9i with 0i 
being the parametric space defined in the alternative hypothesis). The reader 
should notice that the like lihood r atio s tatistic is a generalization for uniformly 



most powerful tests (see iBirkesl . Il990l ) under general hypothesis testing. For 



general linear hypothesis, Hq : CO = d, we can also resort to a Wald-type statistic 

W(x) = {CO- d) T [CAC T }- 1 {C0- d) 

with 9 being a consistent estimator that, under Hq, is (asymptotically) normally 
distributed and A its (asymptotic) covariance-variance matrix computed at 6. 
These two statistics have many important properties and are widely used in ac- 
tual problems. Suppose that X = (Xi, . . . , X n ) is an independent and identi- 
cally distributed (iid) random sample, under some regular conditions on L(9,X) 
and when Qq is a smooth (semi) algebraic manifold, it is well known that, un- 
der Hq, T(X) = — 21og(A(X)) converges in distribution to a chisquare distribu- 
tion with s degrees of freedom (from now on, it is denoted just by Xs); where 
s = dim(O) — dim(Oo) is the co-dimension of Qq. The asymptotic distribution of 
W(X) is a chisquare with rank-of-C degrees of freedom, which is the very same 
of the likelihood ratio statistics for linear general null hypotheses. We can also 
mention the Score test statistics that, under appropriated conditions, has asymp- 
totically the same distribution as the two previous statistics. That is, different 
p-values can be computed for the same problem of hypothesis testing by using 
different procedures. In this paper we shall only use the procedure based on the 
(general ized) likelihood ratio statist i cs, si nce it has optimal asymptotic proper- 



ties (see iBahadur and Raghavacharil 120091 ) . From now on, "asymptotic p- values" 



means p- values computed by using the asymptotic distribution of the test statistic. 

Sometimes practitioners have to test a complicated hypothesis Hqi ■ By reasons 
of easiness of computations, instead of testing Hqi, they may think of testing 
another auxiliar hypothesis H02 such that if Hq2 is false then Hoi is also false. 
This procedure is used routinely in medicine and health fields in general, e.g., 
in genetic studies one of the interests is to test genotype frequencies between two 
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groups. In this example, we know that H m : "homogeneity of genotype frequencies 
between the groups" implies H 2 ■ "homogeneity of allelic frequencies between the 
groups". By using mild logical requirements (the consequence condition), if we 
find evidence against Ho 2 we expect to claim evidence against Hqi- However, as 
it is widely known, p- values do not follow this logical reasoning. 

Let H i : 6 € 0oi and H 02 ■ € 602 be two null hypotheses such that 0oi C 
602, i-e., #01 is nested within H 02 . It is expected by the logical reasoning to find 
more evidence against H i than H 02 for the same observed data X = x. In other 
words, if pi is the asymptotic p- value computed under H i, for i = 1,2, respectively, 
then we expect to observe p\ < p 2 . However, if dimensions of the spaces described 
in these nested hypotheses are different, then their respective asymptotic p- values 
will be computed under different metrics and therefore inverted conclusions may 
eventually occur, i.e., more disagreement with H 02 than H i (i.e., p 2 <pi). That 
is to say, for a given data and a preassigned a, it may happen pi > a and p 2 < a. 
One, therefore, may confront at the same time with "evidence" to reject H$ 2 and 
without "evidence" to reject Hoi- Of course, the problem here is not with the 
approximation for the p- values, computed by using limiting reference distribution, 
the problem also happens with the exact ones. The example below shows the 
above considerations for exact p- values in a multiparametric scenario. 

Example 1.1. Consider an independent and identically distributed random sam- 
ple X = (Xi, . . . ,X n ) where X\ <~ N 2 (/j,,I) with fi — (^i,^i 2 ) T and I is a (2 x 2) 
identity matrix. The full parametric space is O = {(fii,fi 2 ) : fi\,fj, 2 € K} = R 2 - 
For this example we consider two particular hypotheses. Firstly, suppose that we 
want to test H i : e O i, where ©01 = {(0, 0)} 7 then the likelihood ratio statistic 



where X is the sample mean. Taking T\{X) = — 21og(Ai(X)) we know that, under 
Hoi, Xi ~ xi- Secondly, suppose that the null hypothesis is Ho 2 : 8 e ©02, where 
©02 = {(MI1M2) : f-i = f-2, 1^1,^2 € M.}, the likelihood ratio statistic is 



where I = (1, 1) T . Taking T 2 {X) = —2 \og{\ 2 (X)) it is possible to show that, under 
Hq2- T 2 ~ Xi- Notice that, in this example, the Wald statistics for these two null 
hypotheses Hoi and Ho 2 are equal to T\ and T 2 , respectively. Assume that the 
sample size is n — 100 and the observed sample mean is x = (0.14, — 0.16) T , then 
Ti{x) = 4.52 (with p-value pi = 0.10) and T 2 (x) = 4.5 (with p-value p 2 = 0.03/ 
These p-values showed evidence against \i\ = [i 2 , but not against ^1 = \i 2 = 0. 



is 
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However, if we reject that — [i 2 we should technically reject that [i\ = [i 2 = 
(using the very same data). 

This issue does not happen only with the likelihood ratio statistic, it happens 
with many other classical test statistics (score and others) that consider how data 
should behave under H . As p- values are just probabilities to find unobserved 
statistics, at least, as large as the observed ones, the conflicting conclusion pre- 
sented in the above example is not a logical contradiction of the frequentist method. 
This issue happens because a p-value was not designed to be a measure of evidence 
over subsets of 9. We must say that p-values do exactly the job they were defined 
to do. However, in the practical scientific world, researches use p-values to take 
decisions and, hence, they eventually may face some problems with consistency of 
conclusions. P-values must therefore be used with caution when taking decisions 
about a null hypothesis. The example below presents a data set which produces 
surprising conclusions for regression models. 

Example 1.2. Consider a linear model: y = xb + e, where b = (bi,b 2 ) T is a 
vector formed by two regression parameters, x — (xi,x 2 ) is an (n x 2) matrix 
of covariates and e ~ N n (0,I n ) with I n the n x n identity matrix. It is usual to 
verify if each component ofb is equal to zero and to remove from the model the non- 
significant parameters. The majority of statistical routines present the p-values for 
H i : bi = 0, say p iy for i = 1,2. However, sometimes both p-values are greater 
than a and there exists a joint effect that cannot be discarded. As these hypotheses 
include a more restricted one, i?03 b = 0, it is of general advice to remove 
both b\ and b 2 only if the p-value P3, computed for H03, is smaller than a (this 
decision obeys the logical reasoning). We expect to observe more evidence against 
H 3 than either _ff i and H n2 . In fact, almost always the p-value p 3 is smaller 
than both pi and p 2 , as expected. However, as we shall see below, an inversion 
of conclusions may occur. To see that, let us present the main ingredients. The 
maximum likelihood estimator of b is b — (x T x)~ 1 x J y and the likelihood ratio 
statistics for testing Hqi, Hq 2 and H03 are respectively 



where x\ — x 2 and x 2 = x\. It can be showed that Ti(Y,x) = — 2 log(A,(F, x)) ~ 
f or all i = 1,2,3, where, for this example, S\ = s 2 = 1 and S3 = 2. Again, 
although T 3 > Tj, for i — 1,2, the metrics to compute the p-values are different 




for i = 1, 2 and 
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and odd behavior may arise as we notice in the following data, 



y -1.29 1.09 -0.16 0.44 -0.22 -1.85 0.91 0.54 0.06 0.37 
xi 3.00 8.00 5.00 9.00 10.00 1.00 6.00 9.00 6.00 5.00 
x 2 9.00 7.00 7.00 10.00 7.00 8.00 6.00 6.00 3.00 2.00 

Here, the three statistics are t\ = 4.48 (with p-value p\ = 0.03), t% = 4.00 (with 
p-value p2 = 0.045) and t% = 4.59 (with p-value ps = 0.10). For these data, we 
have problems with the conclusion, since we expected to have much more evidence 
against than Hqi and Hq2- Notice that, pz/p\ ~ 3.3 andpz(p2 ~ 2.2. 

Many other examples for higher dimensions can be built on, but we think that 
these two instances are sufficient to illustrate the weakness of p-values when it 
comes to decide accept ance or rejection of specific hypotheses, for other examples 
we refer the reader to ISchervishl (|1996). Observe that we used the very same 



procedure to test both hypotheses Hoi and Hq2 (i.e., likelihood ratio statistics). 
Some scientists and practitioners would become confused with these results and it 
would be very difficult explain to them the reason for that. We believe that the 
development of a true measure of evidence for null hypotheses that does not have 
these problems might be welcome by the scientific community. 

In summary: in usual frequentist significance tests, a general method of com- 
puting test statistics can be used (likelihood ratio statistics, Wald-type statistics, 
Score statistic and so forth) . The distribution of the chosen test statistic depends 
on the null hypothesis and this leads to different metrics in the computation of 
p-values (this is the major factor that gives the basis for the frequentist inter- 
pretations of p- values) . As each of these metrics depend on the dimension of the 
respective null hypotheses, conflicting conclusions may arise for nested hypotheses. 
In the next section, we present a new measure that can be regarded as a measure 
of evidence for null hypotheses without committing any logical contradictions. 

This paper unfolds as follows. In Section [2] we present a definition of evidence 
measure and propose a frequentist version of this measure. Some of its properties 
are presented in Section[3] A connection with the abstract belief calculus is showed 
in Section [H Examples are offered in Section [5j Finally, in Section [6] we discuss 
the main results and present some final remarks. 



2 An evidence measure for null hypotheses 



In this section we define a very general procedure to com pute a measu re of 
evidence for Hq. The concept of evidence was dis cussed b y Good! (|l983) in a 
great philosophical d etail. We also refer the reader to lRovalll (J1997J) and its review 
(|1998I ) for relevant arguments to develop new methods of measuring 



Vieland et al. 
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evidence. As in the previous section, Hq : 9 6 Qq is the null hypothesis, where 
80 C 8 C l fc is a smooth manifold. Below, we define what we mean by evidence 
measure. 

Definition 2.1. Let X C W 1 be the sample space. A value q : X x — > [0, 1] 

is a measure (we shall write just q{Qo) = q(X,Qo) for shortness of notation) of 
evidence of null hypotheses if the following items hold 

1. q(0) = and q(Q) = I, 

2. For any two null hypotheses H m : 9 € 0oi and Hq 2 ■ 9 € @02> such that 
@oi C O02, we must have g(0oi) < <?(@02)- 

The above definition is the least we would expect from a coherent measure 
of evidence. Items 1 and 2 of Definition 12.11 describe a plausibility measure 



([Friedman and Halpernl . Il996| ). Plausibility measures generalize probability mea- 



sures, since: 



''A plausibility measure associates with a set a plausibility, which is 



just a n element in a partially ordered space" ([Friedman and Halpern 



19961) 



As showed in the previous section, p-values are not even plausibility measures on 
O, since Condition 2 of Definition 12.11 is not satisfied. Therefore, they cannot be 
regarded as measures of evidence. Bayes fact ors are also not plausib i lity measures 



on 9 , i.e., Condition 2 fails to be held, (see lLavine and Schervishl Il999t iBickel . 



2010). In order to find a purely frequentist measure of evidence, without prior 
distributions neither over 6 nor H , we define a likelihood-based confidence region 
by 

Definition 2.2. A likelihood-based confidence region with level a is 

A a = {6 G 6 : 2{£(ff) - t{8)) < F a }, 

where 8 — argsup 0ee t{9) is the maximum likelihood estimator, £ is the log- 
likelihood function and F a is an 1 — a quantile computed from a cumulative dis- 
tribution function F , i.e., F(F a ) = 1 — a. Here, F is (an approximation for) the 
cumulative distribution function of the random variable 2(^(0) — £(9q)), where 9q 
is the true value. 

Consider the following conditions: 

CI. 9 is an interior point of <d, 

C2. I is strictly concave. 
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Below we define an evidence measure for the null hypothesis Hq. 

Definition 2.3. Let A a be the likelihood-based confidence region. An evidence 
measure for the null hypothesis H : 9 £ © is the value 

q = g(0 o ) = q(X, O ) = max{0, sup{a € (0, 1) : A Q n O ^ 0}}. 

We shall call q-value for short. 

In words, q is the greatest significance level for which at least one point of ©o 
lies inside the confidence region for 8. A simple way of computing a g-value is 
to build high-confidence regions for 8 that includes at least one point of Oo and 
gradually decreases the confidence until the (1 — q) X 100% confidence region border 
intercepts just the last point(s) of Oo- The value q is such that the (l—q+6) x 100% 
confidence region does not include any points of ©o, for any S > 0. Figure [T] 
illustrates some confidence regions A a for 8 = (81,82) considering different values 
of a. The dotted line is A qi , where qi is the g-value for testing H i : #1 = 0. The 
dot-dashed line is A 92 , where 52 is the g-value for testing H02 : 82 — 0. The dashed 
line is A g3 , where q$ is the g-value for testing f/03 : #1 + $2 = 1- 

Observe that a large value of g(©o) indicates that there exists at least one 
point in Oo that is near the maximum likelihood estimator, that is, data are not 
discrediting the null hypothesis Hq. Otherwise, a small value of g(©o) means that 
all points of Oo are far from the maximum likelihood estimator, that is, data are 
discrediting the null hypothesis. The metric that says what is near or far from 
8 is the (asymptotic) distribution of 2(£(8) — £(8)). These statements are readily 
seen by drawing confidence regions (or intervals) with different confidence levels, 
see Figure [ p 



Bickell (|2010r ) developed a method based on the law of likelihood to quantify 
the weight of evidence for one hypothesis over another. Here, we proposed a 
classical possibilistic measure over O based on likelihood-based confidence regions, 
see Definition 12.31 Although these approaches are based on similar concepts, they 
capture different values from the data (we do not investigate this further in the 
present paper). Another evidence measure that is a Bayesian competitor is the 
FBST (Full Bayesian Significance Test) proposed originally by 



Pereira and Stern 



1999). See also an invariant ver sion under reparame trizations in 



Madruga et al. 



2003) and we refer the reader to lPereira et al\ (|2008l ) for an extensively review of 



this latter method. 
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3 Some important properties 



In this section we show some important properties of q- values that will be used 
to connect them with possibilistic measures and the abstract belief calculus (see 
Section [4]). Our first theorem states that item 1 of Definition 12.11 holds for the 
proposed q- value. 

Theorem 3.1. Let q be a q-value and consider condition CI, then q(0) — and 
q(Q) = 1. 

Proof. As 9 G 8 (see condition CI), we have that {a G (0, 1) : A„n0 / 
0} = (0,1) and then q(Q) = 1. Also, {a £ (0,1) : A„ n ^ 0} = 0, then 
q(0) = max{0, sup(0)} = 0. □ 

The following theorem completes the requirement for the q- value to be a mea- 
sure of evidence. 

Theorem 3.2. (Nested hypotheses) For a fixed data X — x , let Hqi : 8 G 8oi and 
H02 '■ € O02 be two null hypotheses such that 0oi C 802 ■ Then, g(Boi) < (7(802), 
where q(Qoi) an d <?(©02) ore. evidence measures for Hqi and H02, respectively. 

Proof. Observe that if 6 i C 602, then A Q n 9 i CA„fl O 2 and {a G (0, 1) : 
A Q n 6 i ^0}C{a€ (0, 1) : A a n O 2 ^ 0}- We conclude that q{Q i) < q(@02) 
for all 801 C 8 02 C9, □ 

Other important feature of our proposal is its invariance under reparametriza- 
tions. As likelihood-based confidence regions are invariant under reparametriza- 
tions, the g-value is also invariant. Based on Theorem l3.2l we can establish now an 
interesting result which is related to the Burden of Proof, namely, the evidence in 
favour of a compos ite hypothesis is the most favourable evidence in favour of its 



terms ( Sternl . 12003) . 



Lemma 3.1. (Most Favourable Interpretation) Let I be a countable or uncountable 
real subset, assume that 80 = UieJ ® oi * s nonempty, then q(Qo) = swp ieI {q(Qoi)}. 

Proof. By Theorem 13.21 we know that q(Qo) > q(@oi) for all i £ I, then <?(8o) > 
sup ig/ {g(8oi)}- To prove this lemma, we must show that q(G>o) < su P,e/{9(Ooj)}- 
Define A{B) = {a G (0, 1) : A Q n B ^ 0} and note that A(6 ) C \J teI A(Q 0i ). 

Therefore, sup(A(8 )) < sup ([j ieI A(Q oi )j sup( J 4(8 )) < su Pie/ {sup(yl(eo l ))} 
and q(Q ) = sup teI {q(Q 0i )}. □ 



Lemma l3~T1 states that q- values are possibilistic measures on 8 (jDubois and Prade . 



1990: 



Duboisl . l2006f ). Next lemma presents an important result (for strictly concave 



log-likelihood functions), which allows us to connect q- values with p- values. 
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Lemma 3.2. Assume valid conditions CI and C2. For a nonempty Oo and F 
strictly increasing, the q-value can alternatively be defined as 

q(Q ) = l-F(2(£(9)-£(9 ))), (1) 

where 9 = argsup 0eg)o i(6). 

Proof. If Go is nonempty, there exists a € (0, 1) such that A„ n 9o ^ and the 
g-value is just q = sup{a <E (0, 1) : A„ fl 6o ^ 0}. Notice that, as F is strictly 
increasing, we have that 

A Q n 6 = {0 e e : 2(£(9) - £(9)) < F a ] 

= {6 e Bo : F(2(e(6) - £(6))) < 1 - a} 
= {6e9 --l- F(2(£(6) - £(9))) > a} . 

As I is strictly concave, then for all < a < q < 1 we have A, (1 6o C Aq, n Oo 
and 

sup{a € (0, 1) : A a n9 o ^0}= sup {1 - F(2(£(9) - £{9)))} 

and 

sup {1 - F{2{£{9) - 1(6)))} = 1 - F(2(£(9) - t(9 )), 
eee 

where 9q = argsup eg Q 

The value of 9q can be seen as the point of Oo which is in the border of 
(1 — q) X 100% confidence region for 9. Notice that if F is a non-decreasing 
function than we just can claim that 

{9 e 9 : 2(£(9) - 1(9)) < F a } c{^9„: F(2(£(9) - £(9))) < 1 - a}, 

the converse inclusion may not be valid. In general, F can be approximated to a 
quisquare distribution with k degrees of freedom, where k is the dimension of Q 
(this is a strictly increasing function). Based upon this alternative version we can 
directly compare g-values with p-values. In addition, it is possible to derive the 
distribution of q. 

Notice that a p- value would be computed by p = 1 — Fh (2(£(0)— £(9 ))), where 
Fh is the (asymptotic) distribution of 2(£(9) — £(9q)) that depends on H . On the 
other hand, the distribution F is the (asymptotic) distribution of 2(£(9) — £(9q)) 
which does not depend on Hq, where #o is the true value. Now, by Equation ([T]), 
we find a duality between q- values and p- values which is self-evident from Lemma 
1331 
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Lemma 3.3. Assume valid conditions CI and C2. If F and Fh are strictly 
increasing functions, then the following equalities hold 



p=l-F Ho (F- 1 (l-q)) and q = I - F(F^(1 - p)). (2) 

Note that, under general regular conditions on the likelihood function and con- 
sidering that Oo is a smooth semi-algebraic subset of 0, all asymptotics for the 
g-value can be derived by using the duality relation presented above in Equation 
[2] In usual frequentist significance tests, the error probability of type one char- 
acterizes the proportion of cases in which a null hypothesis Hq would be rejected 
when it is true in a hypothetical long-run of repeated sampling. On the one hand, 
as a p- value usually has uniform distribution under Ho, the probability to obtain 
a p- value smaller than a is a. On the other hand, we can only guarantee uniform 
distribution for the evidence value q under the simple null hypothesis Oo = {#o}> 
which specifies all parameters involved in the model. Since, it can be readily seen 
that if ©o = {do}, then F = Fh (an asymptotic quisquare with k degrees of 
freedom) and therefore q ~ [7(0, 1) (at least asymptotically). However, if ©o has 
dimension lesser than k, e.g., under curvature of parameters, the distribution Fu 
would differ from F. Notice that the threshold value a adopted for p-values is 
not valid for q- values, the actual threshold for q- values should be computed using 
relation |(2}, i.e., an = 1 — F(F^(1 — a)) would be the new cut-off. Of course, 
if the decision was based on this actual threshold, the same logical contradictions 
would arise. We remark that, we do not just change the problem to another level, 
indeed q- values are logical consistent measures of evidence and critical values may 
be derived in the light of loss functions or other procedures. This cannot be done 
without committing logical contradictions for p- values, since they are not logically 
co nsistent measu r es of evidence for null hypotheses (see next section). 



Pereira et al 



(2008) left a challenge to the reader, namely, to obtain the one- 



to-one rel ationship be t ween the evidence value computed via FBST (e-value) and 



p-values. iDiniz et ali (|2012l ) showed that asymptotically the answer is given in 
Lemma 13. 31 replacing the g-value with e-value, therefore, g-values and e-values are 
asymptotically equivalent. 



4 Q- value as an objective state of belief 



In this section we analyze the defi nition of q- values under the ligh t of Abstract 
Belief Calculus (ABC) proposed by iDarwiche and Ginsberg! (| 19921 ) . ABC is a 
symbolic generalization of probabilities. Probability is a function defined over a 
family of subsets (known as a-field) of a main nonempty set £1 to the interval [0, 1]. 
The additivity is the main characteristic of this function of subsets, that is, if A 
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and B are disjoint measurable subsets, then the probability of the union A U B 
is the sum of their respective probabilit ies. Basica lly, all theorems of probability 
calculus require this additive property. ICoxl (I1946T) der i ved this sum rule from a 



Acze 



20041 ) . which tries 



set of more fundamental axioms (see also iJavnesf . 11957 
to consider the following assertions 

"... the less likely is an event to occur the more likely it is not to occur. 
The occurrence of both of two events will not be more likely and will 
generally be less likely than the occurrence of the less likely of the two. 
But the occurrence of at least one of the events is not less likely and 
is generally more likely than the occurrence of either" (jCoxl Il946h . 



The quotation above is alleged to be a fundamental part of any coherent reasoning 
and, based on it, some scholars claim that deg rees of belief should be manipulated 



1946: 



Javnes . 



1957 



Caticha . 



accor ding to the laws of probability theory (|Cox , 
2009) . The word "likely" could be replaced by "probable" , "possible" , "plausible" 
or any othe r that represents a measur e for our belief or (un) certainty under limited 



knowledge. iDubois and Pradd (|200ll ) said that, under a limited knowledge, 



"one agent that does not believe in a proposition does NOT imply the 
(s)he believes in its negation" 

and also that 

"uncertainty in propositional logic is ternary and not binary: either a 
proposition is believed, or its negation is believed, or neither of them 
are believed" . 

Therefore, under a limited knowledge, the claim that "the less possible is an event 
to occur the more possible it is not to occur" is too restrictive to be of universal 
applicability in general beliefs. 

The Cox's demonstration is made through associativity functional equations 
(let G : I 2 -> 1 be a real function with two real arguments, then G(x, G(y, z)) = 
G(G(x,y), z) is the associativity functional equation) that represent the second 
sentence of the above quotation, the involved function (G) is considered continuous 
and strictly increasing in both its arguments. Howeve r, when this function is 
non-d ecreasing we also have a coherent reasoning (see iDarwiche and Ginsberg , 
1992) and other than additive rules emer ge from th i s func tional equation, such as 
minimum (maximum) as demonstrated bv lMarichall (|2000l ). Therefore, probability 
is not the unique coherent way of dealing with uncertainties as usu ally thought 
and spread among some scholars. Moreover. IDubois and Pradd (|200l[ ) clarify that 
probability theory is not a faithful representation of incomplete knowledge in the 
sense of classical logic as usual considered. 
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In our case, the set that we want to define an objective state of belief is the 
parametric space 6. Here, the word "objective" means that no prior distributions 
over 9 are specified. Naturally there exists some level of subjective knowledge in 
the choice of models, parametric space and so on, these sources of subjectivity will 
not be discussed further. It is widely known that the frequentist school regards 
no probability distributions over the subsets of G and that probability distribu- 
tions are only assigned for observable randomized events. Here, we show that the 
proposed g-value can prescribe an objective state of belief over the subsets of 6 
without assigning any prior subjective probability distributions over the subsets 
of 0. It is quite obvious that a g-value is not a probability measure, since it is 
not additive. However, as we shall see in this section, g-values are indeed abstract 
states of belief. 

Abstract belief calculus (ABC) is built considering more basic axioms than 
regarded in Cox derivation and, as a consequence, the additive property must 
be generalized to a summation operator © such that the usual sum rule is a 
simple particular case (they also deal with propositions instead of sets, but here 
we consider that propositions are represented explicitly as sets). For the sake of 
completeness, we expose the main components of their theory in what follows. 
Firstly, let £ be a family of subsets closed under unions, intersections and 
complements. The ABC starts defining a function $ : C — > S called support 
function, and for each 4e£, &(A) is called the support value of A. Then, in 
order to define the properties of this support function, a partial support structure 
(<S, ffi) is defined such that the summation support © : S x S — > S satisfies the 
following properties: 

• Symmetry: a © b — b © a for every a, b G S 

• Associativity: (a © b) © c = a © (6 © c), for every a, b, c e S 

• Convexity: For every a,b,c £ S, such that (a©6)ffic = a, then also a© 6 = a 

• Zero element: There exists a unique £ 5 such that a © = a for all a e S 

• Unit element: There exists a unique element 1 e S, where 1^0, such that 
for each a e 5, there exists b £ S such that a © b = 1 

Then, the properties of $ are the following 

1. For A,B eC, such that A C B and B C A, then = $(£) 

2. For A,B e C, such that A n B = 0, then $(A UB) = $(A) © $(B) 

3. For A,B,C e C, such that A C B C C and $(A) = $(C), then ®(A) = 
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4. $(0) = and $(6) = 1 

Backing to our proposal and taking S — [0,1], © = sup, = 0, 1 = 1 and 
$ = q we see, by Theorems 13.11 and 13 . 2 [ that the q- value satisfies all the properties 
above. Therefore, for our problem, we identify that our q- value is acting precisely 
as a support function $ in the ABC formalism. It is noteworthy that the support 
value of A does not determine the support value of A c (the complement of A). This 
determination happens in the probability calculus since, when S = [0, 1] and © = 
+ , the probabilities of sets that form a par tition of the total space must su m up to 
one. Then, for abstract support functions, iDarwiche and Ginsberg! (jl992n defined 
the degree of belief function $:£^5x5, such that $(A) = ($(A), $(A C )). The 
value of Q(A) is said to be the degree of belief of A. Naturally, for the probability 
calculus this function is vacuous, since if S = [0, 1], © = + and $ is a probability 
function then for any measurable subset A, &(A) = ($(A), 1 — 

Now, let a,b £ S, then if there exists c £ S such that a © c = b we say that the 
support value a is no greater than the support v alue b and we use the notation 
a di® b, the symbol is called as support order. IDarwiche and Ginsberg! (| 1992T) 
showed that ^ s is a partial order under which is minimal and 1 is maximal. 
Also, let $(A) — (01,02) and &(B) = (61,62) be degrees of beliefs of A and B, 
respectively. We say that the degree of belief of A is no greater than the degree 
of belief of B if 01 ;jm fri and 6 2 ^© &i, this is represented by &(A) C e $(£?). 



Darwiche and Ginsberg! (| 19921 ) also showed that is a partial order under which 



(0, 1) is minimal and (1, 0) is maximal. In this coherent framework, 3>(A) = (1, 1) 
is fully possible without having any contradictions (of course that in probability 
measures this cannot happen). The authors clarified this in the quotation below: 

"A sentence is rejected precisely when it is supported to degree 0. And 
a sentence is accepted only if it is supported to degree 1. But if the 
sentence is supported to degree 1, it is not necessarily accepted. For 
example, when degrees of support are S — {possible, impossible}, a 
sentence and its negation could be possible. Here, both the sentence 
and its negation are supported to degree possible, but neither is ac- 
cepted." 

As our q- value is a support function for the subsets of 0, we may used this 
measure to state degrees of belief for the subsets of 0. Notice that we cannot do 
this with the usual concept of p- value by the following. If A C B C ©, then A 
e A c n B are disjoint sets, therefore as B = A U (A c n B) we have by Property 
2 that $(B) = © $(A C n B) and then $(A) ^ e that is, if A is a 

subset of B the support value $(A) is no greater than the support value $(B). 
By our Examples 1 1 . 1 1 and 1 1 . 21 we see that p- values do not satisfy this requirement. 
As aforementioned, q-values are not probabilities measures on the subsets of 
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and this is not a weakness as some may argue (for subsets of 9 with dimensions 
smaller than 9, the best measure of evidence that probability m easures can provide 
is zer o). We remark that p- values and the Bayesian e- values (jPereira and Sternl . 



1999) are not probability measures on 9, moreover, usually defined p- values cannot 



even be included in the ABC formalism to establish objective states of belief on 
the subsets of 6. Hence, g-values can be used to fill this gap. 

Consider the null hypothesis Ho : 8 E 9o, by Condition CI, we know that 
E 6o or 9 E 8q, then q(Qo) = 1 or q(Q c ) = 1. As the q-value is a measure 
of support and (0,1) is minimal and (1,0) is maximal, we can readily reject H 
provided that $(9o) = (<z(9o), <z(©o)) = (Oj 1) an d readily accept H provided that 
$(9o) = (q(Qo),q(@o)) = (1,0). In these two cases we have complete knowledge. 
When $(6o) = (g(6o), <z(9q)) = (1, 1) we have complete ignorance regarding this 
specific hypothesis and cannot either accept or reject, then we must perform other 
experiment for gathering more information. In general, we have intermediate states 
of knowledge, namely: (1) <J>(9 ) = (a, 1) or (2) $(6 ) = (1, b), where a, b E [0, 1]. 
In the first case, we say that there is evidence against Hq if a is sufficiently small 
(i.e., a < a c for some critical value a c ) and we say that the decision is unknown 
whenever a is not sufficiently small. In the second case, we say that there is 
evidence in favor of Hq if b is sufficiently small (i.e., b < b c for some critical value 
b c ) and we say that the decision is unknown whenever b is not sufficient small. 
The problem now is to define what is "sufficiently small" to perform a decision. 
The decision rule may be derived through loss functions or other procedure, we 
will study this issue in future works. Whatever the chosen procedure, it should 
respect the minimal ($(9o) = (0, 1)), maximal ($(9o) = (1,0)) and inconclusive 
($(9 ) = (1, 1)) features of possibilistic measures (we stress that p-values have a 
similar behaviour, since they are almost possibilistic measures). 

Notice that if the null and alternative hypotheses (Hq : 9 E 9o and Hi : 9 E 9q, 
respectively) are fixed before observing the data, it is possible to obtain a maximum 
likelihood estimate that lies in the null parametric space, 9 E 9o- For instance, 
testing the mean of a normal distribution H a : fi < against Hi : fx > 0, we 
may eventually observe a negative sample average, X < 0, then the respective 
p-value computed by using likelihood ratio statistics will equal one. This happens 
in general provided 9 E ®o. In practice, practitioners formulate a hypothesis 
suggested by the data to avoid such feature (i.e., in the mean testing of the normal 
distribution: if X > 0, then U H : /U < against Hi : fx > 0"; if X < 0, 
then "Hq : /i > against Hi : /i < 0"; the null and alternative hypotheses are 
interchanged with modifications for the null parametric space being closed). If 
the maximum likelihood estimate lies in the border of the null parametric space, 
then the respective p-value will generally equal one for both cases. For instance, 
in the normal example, if X = the respective p- values will be one for either both 



15 



cases: "Hq : fi < against H i : /x > 0" and "ifo : M > against Hi : fj, < 0" . 
As g-values are support measures, we do not need to interchange the null and 
alternative hypotheses, all we have to do is to analyze the hypothesis according to 
the rules of possibilistic measures stated in the previous paragraph. 



5 Examples 

In this section we apply our proposal to the Examples 11.11 and 11.21 and we also 
consider an example for the Hardy- Weinberg equilibrium hypothesis. 



Example 5.1. Consider Example ] 1 . 11 after a straightforward computation we find 

2(£(9)-£(9)) = n(X-^ T (X- t i), 

where X — (X\, X2) T and /i = (/ii, /i2) T and F a is the a quantile from a quisquare 
distribution with two degrees of freedom. Then, q-values for Hoi : fj,i = fi2 = 
and Hq2 '■ Mi = M2 ar e respectively q\ = 0.104 and qi — 0.105. Note that the curve 
Mi = A*2 intercepts A q2 at (—0.01, —0.01). As expected for this case, q\ < qi, since 
6oi C 602, in addition, q\ and q2 are near each other because the variables are 
independent. If the variables were correlated, those q-values would differ drastically 
(being q2 always greater than q\). 

Example 5.2. Consider Example M.'A the maximum likelihood estimates for b\ 
and &2 are respectively b\ = 0.1966 and 62 = —0.1821. Here, we find that 

2(1(0) -£(0)) = (b-b) T (x T x)(b-b). 

Then, F a is the a quantile from a quisquare distribution with two degrees of freedom 
and the q-values for H$i : b\ = 0, Hq2 : 62 = and ii/03 : (^1,^2) = (0,0) are 
respectively q\ — 0.107, (72 = 0.135 and q% = 0.101. Therefore, as expected by the 
logical reasoning q\ > q^ and q2 > 93 . 

We also compare our results with the FBST approach considering a trinomial 
distribution and the Hardy- Weinberg equilibrium hypothesis. 

Example 5.3. Consider that we observe a vector of three values xi,X2,Xs, in 
which the likelihood function is proportional to O^O^O^ 3 , where X\ + X2 + X3 = n 
and the parametric space is O = {9 £ (0, l ) 3 : 9\ + Q2 + 0?, = 1|. Here, we use 



the same settings described in Section 4-3 bu lPereira and Stern Hl99yi ), that is, the 



null hypothesis is Qq = ^9 G : #3 = (1 — a/^T) 2 j" an d n — 20. 

Table [6] presents the q-v alues for all values of a nd X3. The last two columns 



were taken from Table 2 of 



Pereira and Stern 



(1999). It should be said that we 
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computed the q- values by using Definition 12.31 instead of Relation (|21). because 
the p- values were presented with two decimal places in iPereira and Sternl (|l999l) 
and this can induce distorted q- values. As it was seen, our proposal yields similar 
results to the FBST approach. 



6 Discussion and final remarks 



Berger and Sellkd (|1987l ) compare p- values with posterior probabilities (by us- 
ing objective prior distributions) and find differences by an order of magnitude 
(when testing a normal mean, data may produce a p-value of 0.05 and posterior 
probability of the null hypothesis of at least 0.30). As opposed to posterior distri- 
butions, p-values do not hold the requirement of evidence measures, but one can 
also conclude that posterior probabilities cannot be used to reflect probabilities in 
a hypothetical long-run of repeated sampling. Also, posterior probabilities cannot 
provide a measure of evidence different from zero under sharp hypotheses (when 
the dimension of the null parametric space is smaller than the full parametric 
space). A Bayesian procedure that provides a positive evidence measure under 
sharp hypotheses is the FBST. In this context, g-values are directly comparable 
with the evidence measures of the FBST approach. This latter procedure needs 
numerical integrations and maximizations, which may be difficult to be attained 
for high dimensional problems. As we studied in previous sections, our procedure 
produces similar results to the FBST and can be readily used as a classical alter- 
native (when the user does not want to specify prior distributions). Moreover, if 
one has a p-value computed via likelihood ratio statistic, then Relation ([2]) may 
be applied to compute the respective g-value without any further computational 
procedures (maximizations and integrations) and, also, this relation allows to de- 
rive the q- value distribution (if desired). Also, we should mention that q- values, 
when computed using the asymptotic distribution of 2(£(9) — £(9o)), do respect 
the famous likelihood principle, but we should say that it is not the main concern 
here, it is just a property of our approach. Naturally, if the exact distribution of 
2{£{9) — £(9 )) is adopted, the likelihood principle may be violated. 

When the null hypothesis is simple and specifies the full vector of parameters, 
say 0o = {#o}j the proposed g-values are, in general, p-values. Otherwise, q- 
values cannot be interpreted as p- values, instead, they must be treated as measures 
of evidence for null hypotheses. As aforementioned, when treated as evidence 
measures, p- values have some internal undesirable features (in some cases, for 
nested hypotheses Hqi and i?02, where Hqi is nested within i?02, p- values might 
give more evidence against i?02 than Hqi). On the other hand, p- values respect 
the repeated sampling principle, that is, in the long-run average actual error of 
rejecting a true hypothesis is not greater than the reported error. In other words, 
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as p- values have uniform distribution under the null hypothesis, the frequency 
of observing p- values smaller than a is a. This is an external desirable aspect, 
since this allows us to verify model assumptions and sensibility, among many 
other things. The proposed g-values overcome that internal undesirable aspect 
of p-values, but the problem now is how to evaluate a critical value to establish 
a decision rule for a hypothesis based on g-values (this decision should respect 
the rules of possibilistic measures). If we want to respect the repeated sampling 
principle, based on Lemma 13.31 we see that a critical value for q depends on Hq. 
To see that, let a be the chosen critical value for the computed p- value, then it 
can be "corrected" to a' Ho = 1 — F(F^(1 — a)) for the respective q- value. This 
threshold value a' Ho will respect the repeated sampling principle if and only if 
it varies with Hq. If we adopt this "corrected" critical value we will have the 
same internal undesirable features of p-values. We must rely on other principles 
to compute the threshold value for our g-value, maybe based on loss functions. 
These loss functions may incorporate the scientific importance of a hypothesis to 
elaborate a reasonable critical value (this issue will be discussed in future work). It 
is well known that statistical significance is not the same a s scientific significance, 
for a further discussion we refer the reader to ICoxl (|l977l ). Naturally, we could 



also employ loss functions on p-values to find a threshold, however, the internal 
undesirable features of p-values will certainly bring problems to implement this 
without any logical conflicts. 

There are many open issues that need more attention regarding q- values. Next 
we provide a list of open problems that we did not deal with in this article, but 
will be subject of our future research. 

1 . To give a rigorous mathematical treatment when the log-likelihood function 
t is not strictly concave. 

2. To derive a computational procedure to find q- values and their distribution 
for (semi) algebraic subsets 0o and not strictly concave I. 

3. To compare theoretical properties of g-values by using other types of confi- 
dence regions. Monte Carlo simulations may be required. 

4. To compare q-values wit h evidence values (e-value) computed via FBST 



(Pereira and Stern 



factor (Aitkin 



19991 ) and other procedures such as the posterior Bayes 



19911 ) in a variety of models by using actual data. 



5. To derive a criterion to advise one out of three decisions "acceptance", "re- 
jection" or "undecidable" of a null hypothesis without having any types of 
conflict. 
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We end this paper by saying that we are not advocating a replacement of p- 
values by g-values. Instead, we just recommend g-values as additional measures 
to assist data analysis. 
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e, 



Figure 1: Borders of confidence regions A a , for different values of a. The dotted 
line is A 9l , where q\ is the q- value for testing Hqi : 0\ = 0. The dot-dashed line 
is A q2 , where 92 is the q- value for testing H02 : 62 — 0. The dashed line is A 93 , 
where qs is the q- value for testing 7Jq3 : Of + Q\ = 1- 
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Table 1: Tests of Hardy- Weinberg equilibrium 





X3 


q- value 


e-value 
(FBST) 


p- value 


1 


2 


0.00 


0.01 


0.00 


1 


3 


0.02 


0.01 


0.01 


1 


4 


0.04 


0.04 


0.02 


1 


5 


0.10 


0.09 


0.04 


1 


6 


0.20 


0.18 


0.08 


1 


7 


0.33 


0.31 


0.15 


1 


8 


0.50 


0.48 


0.26 


1 


9 


0.68 


0.66 


0.39 


1 


10 


0.84 


0.83 


0.57 


1 


11 


0.95 


0.95 


0.77 


1 


12 


1.00 


1.00 


0.99 


1 


13 


0.96 


0.96 


0.78 


1 


14 


0.85 


0.84 


0.55 


1 


15 


0.68 


0.66 


0.33 


1 


16 


0.48 


0.47 


0.16 


1 


17 


0.29 


0.27 


0.05 


1 


18 


0.13 


0.12 


0.00 


5 





0.01 


0.02 


0.01 


5 


1 


0.10 


0.09 


0.04 


5 


2 


0.32 


0.29 


0.14 


5 


3 


0.63 


0.61 


0.34 


5 


4 


0.90 


0.89 


0.65 


5 
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1.00 


1.00 


1.00 
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0.91 


0.90 


0.66 


5 
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0.69 


0.66 


0.39 


5 
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